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(54) Block encryption/decryption apparatus tor RiJndael/AES 



(57) The present Invention concerns in particular the 
efficient implementation of encryption or decryption 
rounds of data encryption algorithms, particularly the Rl- 
jndael Block Cipher. The invention provides an appara- 
tus for encrypting or decrypting a data block, the appa- 
ratus comprising a hransforrnation module and a pkirality 
of shift registers each comprising a sequence of data 
registers through which data components are shifted in 
successive operational cycles. At least some of the data 
registers are associated with a respective selector 



switch, the setting of which determines whether the as- 
sociated data register Is loaded with a data component 
from a data register In its respective shift register or with 
a transformed data component corresponding to its re- 
spective shift register. The provision of shift registers 
and switches affords a significant saving in circuit area. 
Further, the invention requires a relatively low number 
of switches (e.g. multiplexers) in the computational data 
paths and this allows a relatively high throughput to be 
achieved. 
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Description 

FIELD OF THE INVENTION 

5 [0001 ] The present invention relates to the held of data encryption. The Invention relates particularly to Improvements 
in the scheduling of data In a data encryption or decryption apparatus. 

BACKGROUND TO THE INVENTION 

to [0002] Secure or private communication, particularly over a telephone network or a computer network, is dependent 
on the encryption, or enciphering, of the data to be transmitted. One type of data encryption, commonly known as 
private key encryption or symmetric key encryption, involves the use of a key, normally in the form of a pseudo-random 
number, or code, to encrypt data In accordance with a selected data encryption algorithm (DEA). To decipher the 
encrypted data, a receiver must know and use the same key in conjunction with the inverse of the selected encryption 

15 algorithm. Thus, anyone who receives or Intercepts an encrypted message cannot decipher It without knowing the key. 
[0003] Data encryption is used in a wide range of applications Including IPSec Protocols. ATM Cell Encryption, Secure 
Socket Layer (SSL) protocol and Access Systems for Terrestrial Broadcast. 

[0004] In September 1 997 the National Institute of Standards and Technology (NIST) issued a request for candidates 
for a new Advanced Encryption Standard (AES) to replace the existing Data Encryption Standard (DES). A data en- 
20 cryption algorithm commonly known as the Rijndael Block Cipher was selected for the new AES. 

[0005] The present Invention concerns in particular the efficient Implementation of encryption or decryption rounds 
of data encryption algorithms, particularty the Rijndael Block Cipher. 

Summary of the Invention 

25 

[0006] A first aspect of the invention provides an apparatus for encrypting or decrypting a data block comprising a 
plurality of data components over a plurality of operational cycles, the apparatus comprising a transformation module 
arranged to perform one or more encryption or decryption operations fn each operational cycle; and a plurality of shift 
registers each comprising a sequence of data registers through which data components are shifted In successive 
so operational cycles, the tiBrmformation module being arranged to receive a respective data comportent from a respective 
data register from each shft 

transformed data components, wherein at least some of said data registers are associated with a respective selector 
switch, the setting of which selector switch in each operational cycle determines whether the associated data register 
is loaded with a data component from a data register in its respective shift register or with the transformed data com- 
as portent corresponding to its respective shift register In said operational cycle. 

[0007] The provision of shift registers and switches In accordance with the invention affords a significant saving in 
circuit area. Further, the Invention requires a relatively tow number of switches (e.g. multiplexers) in the computational 
data paths and this allows a relatively high throughput to be achieved. 

[0006] Preferably, the apparatus is arranged to perform encryption or decryption In accordance with the Rijndael 
40 cipher. More preferably, the transformation module is arranged to perform, In whole or In part, a Rijndael encryption or 
decryption round. Preferably, trie apparatus Is arrariged to operates 

each component comprising one data byte, wherein each shift register comprises four one-byte data registers. More 
preferably, the tiansforrnat^ 

Preferably, each switch (emprises a 2-to-1 selector switch. 
45 [0009] In a preferred emboc*rnent, the apparatus comprises an apparatus for performing encryption in accordance 
with the Rijndael cipher . In an alternative enrtx>dfrnerit, the apparatus co 
in accordance with the Rijndael cipher. 

[0010] A second aspect of the invention provides a method of encrypting or decrypting a data block, comprising a 
plurality of data components, over a pluraity of operational cycles, the method comprising: loading the data components 

so into a respective data register, each data register being one of a sequence of data registers In one of a plurality of shift 
registers; and In respect of each operational cycle, causing a data component from one data register of each shift 
register to undergo one or more data encryption or decryption operations to produce a corresponding transformed data 
component; and setting at least one selector swtteh to determine whether an associated data register is loaded with a 
data component from a data register in its respective shift register or with the transformed data component correspond- 

55 ing to its respective shift register. 

[001 1 ] A third aspect of the invention provides a computer program product comprising computer usable Instructions 
for generating an apparatus according to the first aspect of the invention. 

[0012] The apparatus of the invention may be Implemented In a number of conventional ways, for example as an 
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Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA): The Implementation proc- 
ess may also be one of many conventional design methods Including standard cell design or schematic entry/layout 
synthesis. Alternatively, the apparatus may described, or defined, using a hardware description language (HDL) such 
as VHDL, Verilog HDL or a targeted netKst format (e.g. xnf , EDIF or the like) recorded m an electronic file, or computer 
5 useable file. 

[0013] Thus, the invention further provides a computer program, or computer program product, comprising program 
Instructions, or computer usable instructions, arranged to generate, in whole or In part, an apparatus according to the 
first aspect of the invention. The apparatus may therefore be Implemented as a set of suitable such computer programs. 
Typically, the computer program comprises computer usable statements or instructions written In a hardware descrip- 

10 tk>n, or definition, language (HDL) such as VHDL, Verilog HDL or a targeted netllst format (e.g. xnf, EDIF or the like) 
and recorded In an electronic or computer usable file which, when syntheslsed on appropriate hardware synthesis 
tools, generates semiconductor chip data, such as mask definitions or other chip design information, for generating a 
semiconductor chip. The invention also provides said computer program stored on a computer useable medium. The 
invention further provides semiconductor chip data, stored on a computer usable medium, arranged to generate, In 

'5 whole or In part, an apparatus according to the Invention. 

[001 4J Other advantageous aspects of the invention will be apparent to those ordinarily skilled In the art upon review 
of the following description of specific embodiments and with reference to the accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

20 

[001 5] Embodiments of the invention are now described by way of example and with reference to the accompanying 
drawings in which: 

Figure 1 a is a representation of data bytes arranged in a State rectangular array; 

25 

Figure 1b is a representation of a cipher key arranged in a rectangular array; 
Figure 1c is a representation of an expanded key schedule; 
so Figure 2 is a schematic (lustration of the Rijndael Block Cipher, 

Figure 3 is a schematic illustration of a normal Rijndael Round; 

Figure 4 is a schematic representation of a data encryption apparatus arranged in accordance with the invention; 

35 

Figure 5 is a schematic representation of a typical round transform operation; 

Figures 6a to 6e illustrate in schematic form an encryption round mortjle comprising a data scheduling apparatus 
arranged in accordance with the invention; 

40 

Figure 7 Is a schematic representation of a data decryption apparatus arranged in accordance with the invention; 
and 

Figures Ba to 8e illustrate in schematic form an decryption round module comprising a data scheAiling apparatus 
45 arranged in accordance with the Invention. 

DETAILED DESCRIPTION OF THE DRAWINGS 

[0016] The Rijndael algorithm is a private key. or symmetric key, DEA and is an iterated block cipher. The Rijndael 
so algorithm (hereinafter •Rijndael') Is defined in the publication "The Rijndael Block Cipher AES proposer by J. Daemon 
and V. Rijmen presented at the First AES Candidate Conference (AES1 ) of August 20-22, 1 998, the contents of which 
publication are hereby incorporated herein by way of reference. 

[0017] In accordance with many private key DEAs, including Rijndael, encryption b performed in multiple stages, 
commonly known as iterations, or rounds. Each round uses a respective sub-key, or round key, to perform its encryption 
55 operation . The round keys are derived from a primary key, or cipher key. 

[0018] The data to be encrypted, sometimes known as plaintext, is divided into blocks for processing. Similarly, data 
to be decrypted is processed in blocks. With Rijndael, the data block length and cipher key length can be 128, 192 or 
256 bits. The NIST requested that the AES must implement a symmetric block cipher with a block size of 128 bits, 
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hence the variationsof Rijndael Which can operate on larger block sizes do not form part of the standard Itself ; Rijndael 
also has a variable number of rounds namely, 10, 12 and 14 when the cipher key lengths are 128, 192 and 256 bits 
respectively. 

[001 9] With reference to Figure 1 a, the transformations performed during the Rijndaet encryption operations consider 
5 a data block as a 4-column rectangular array, or Stale (generally indicated at 10 In Figure 1a), of 4-byte vectors 12. 
For example, a 128-bit plaintext (i.e. unencrypted) data block consists of 1 6 bytes, Bq, B 1f Bg, B3, B 4 ... B 14 , B 15 . Hence, 
in the State 10, Bq becomes P 00 , B, becomes P t fi , B2 becomes ... B 4 becomes P 01 and so on. Figure la shows 
the state 10 for the standards compliant 128-bit data block length. For data block lengths of 192-bits or 256-bits, the 
state 10 comprises 6 and 8 columns of 4-byte vectors respectively. 
10 [0020] With reference to Figure 1b, the cipher key is also considered to be a multi-column rectangular array 14 of 
4-byte vectors 1 6, the number of columns, depending on the cipher key length. In Figure 1 b, the vectors 1 6 headed 
by bytes Kq 4 and are present when the cipher key length is 192-bits or 256-bits, while the vectors 16 headed by 
bytes Kg 6 and Kq 7 are only present when the cipher key length is 256-bits. 

[0021 J Referring now to Figure 2, there is shown, generally indicated at 20, a schematic representation of Rijndael. 

15 [00221 The algorithm design consists of an initial data/key addition operation 22, in which a plaintext data block Is 
added to the cipher key, followed by nine, eleven or thirteen rounds 24 when the key length is 128-bits, 192-bits or 
256-bits respectively and a final round 26, which is a variation of the typical round 24. There is also a key schedule 
operation 28 for expanding the cipher key in order to produce a respective different round key for each round 24, 26. 
[0023] Figure 3 Illustrates the typical Rijndael round 24. The round 24 comprises a ByteSub transformation 30, a 

20 ShiftRcw transformation 32, a MixColumn transformation 34 and a Round Key Addition 36. The ByteSub transformation 
30, which is also known as the s-box of the Rijndael algorithm, operates on each byte in the State 10 Independently. 
[0024] The s-box 30 Involves finding the multiplicative Inverse of each byte In the finite, or Galois, field GF(2 e ). An 
affine transformation is then applied, which involves multiplying the result of the multiplicative Inverse by a matrix M 
(as defined in the Rijndael specification) and adding to the hexadecimal number '63' (as is stipulated in the Rijndael 

25 specification). 

[0025] In the ShrftRow transformation 32, the rows of the State 1 0 are cyclically shifted to the left. Row 0 is not shifted, 
row 1 is shifted 1 place, row 2 by 2 places and row 3 by 3 places. 

[0026] The MixColumn transformation 34 operates on the columns of the State 10. Each column, or 4-byte vector 
12, is considered a polynomial over GF(2*) and multiplied modulo x*+1 with a fixed polynomial c(x), where, 

30 

c(x) = '03' x 3 + '01 ' x 2 + '01 'x + W (1) 

(the Inverted commas surrounding the polynomial coefficients signifying that the coefficients are given in hexidecknal). 
35 [0027] The MixCol transformation 34 operates on each column (ColO to Col3) of the State 1 0. Each column is con- 
sidered a polynomial over GF^ and multiplied modulo x*+1 with a fixed polynomiaJ c(x) as set out in equation [1 1 for 
encryption and equation [21 below for decryption. This can be considered as a matrix multiplication as folows: 

During encryption: 

40 

ft, 

A. 

During decryption: 
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[4] 



io [0028] Whore the input to the MixCol transformation 34 may be denoted in State format as follows: 



ColO Coil Col 2 Col 3 
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20 

[0029] And the output of the output may be denoted in State format as: 



ColO CoM Col2 Cot 3 



RowO 


bo 


b« 


>» 


b,2 


Row 1 


b, 


b> 
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bts 


Row 2 


bj 


be 


dm 


bu 


Row 3 


b» 


D7 


bit 


bis 



as [0030] Equations [3] and [4] illustrate the matrix multiplication for the first column [a^aj of the Input State to produce 
the first column [ho-bj of the output State. The MixCol transformation performs the same multiplication for the remaining 
columns of the input state to produce corresponding output State columns. The values given In the multiplication ma- 
trices in [3] and [4] correspond respectively with the coefficients of the fixed polynomial c(x) given in equations [ 1 ) and 
[2]. These values are specific to the Rljndael algorithm. 

40 [0031] Rnatty in Round Key Addition 36, the State 10 bytes and the round key bytes are added by a bitwise XOR 
operation. 

[0032] In the final round 26. the MixCoiumn transformation 34 is omitted. 

[0033] The ByteSub, ShiftRow, and MixCol transformations are well documented in the Rijndael specification and 
there are a number of convent ionai ways in which they each may be implemented. 

45 [0034] The Rijndael key schedule 28 consists of two parts: Key Expansion and Round Key Selection. Key Expansion 
involves expanding the cipher key into an expanded key, namely a inear array t5 (Fig. 1c) of 4-byte vectors or words 
17, the length of the array 15 being determined by the data block lenpjh, N> (in bytes) muttplted by the number of 
rounds, N r plus 1 , i.e. array length = *V (Af, + 1 ). In sta/idards-cornpiiant Rijndael, the data block lenojh is four words, 
N b = 4. When the key block length, N k = 4, 6 and 8, the number of rounds is 10, 12 and 14 respectively. Hence the 

50 lengths of the expanded key are as shown In Table 1 below. 



Table 1. 



Length of Expanded Key for varying Key Sizes 


Ma Block Length, N b 


4 


4 


4 


Key Block Length, N k 


4 


6 


8 


Number of Rounds, N r 


10 


12 


14 



5 
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Table 1 . (continued) 



Length of Expanded Key for Varying Key Sizes 



Expanded Key Length 



44 



52 
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[0035] The first N k words of the expanded key comprise the cipher key. When N k = 4 or 6, each subsequent word, 
W[Q, Is found by XORIng the previous word, W[H 1, with the word N k positions earlier, Wll-NJ. For words 1 7 In positions 
which are a multiple of A/* a transformation is applied to W[M ] before It is XORed. This transformation involves a cyclic 
shift of the bytes In the word 17. Each byte is passed through the Rijndael s-box 30 and the resulting word is XORed 
10 witharoundconstantstipulatedby Rijndael (see Rcon(i) function descrfoed below). However, when ^=8, an additional 
transformation Is applied: for words 1 7 in positions which are a multiple of ((A^i) + 4), each byte of the word, Wp-1], 
is passed through the Rijndael s-box 30. 

[0036] The round keys are selected from the expanded key 15. In a design with N r rounds, N, +1 round keys are 
required. 

15 For example a 1 0-round design requires 1 1 round keys. Round key 0 comprises words W[0] to W[3] of the expanded 
key 15 (i.e. round key 0 corresponds with the cipher key Itself) and is utilised In the initial data/key addition 22, round 
key 1 comprises W[4] to W[7] and is used in round 0, round key 2 comprises W[8] to W[11] and is used In round 1 and 
so on. Finally, round key 10 is used In the final round 26. 

[0037] The decryption process in Rijndael Is effectively the Inverse of Its encryption process. Decryption comprises 
20 an inverse of the final round 26, inverses of the rounds 24, followed by the initial data/key addition 22. The data/key 
addition 22 remains the same as it involves an XOR operation, which Is its own inverse. The Inverse of the round 24, 
26 is found by Inverting each of the transformations in the round 24, 26. The inverse of ByteSub 30 is obtained by 
applying the inverse of the affine transformation and taking the multiplicative Inverse In GF(2°) of the result. In the 
Inverse of the ShiftRow transformation 32, row 0 is not shifted, row 1 is now shifted 3 places, row 2 by 2 places and 
25 row 3 by 1 place. The polynomial, c(x), used to transform the State 1 0 columns In the Inverse of MixColumn 34 Is given 

by. 



c(x) » W x 3 + 'OCT x z + W x + 'OE' (2) 

[0038] Similarly to the data/key addition 22, Round Key addition 36 Is Its own Inverse. During decryption, the key 
schedule 28 does not change, however the round keys constructed for encryption are now used in reverse order. For 
example, in a 1 0-round design, round key 0 is still utilized in the initial data/key addition 22 and round key 10 in the 
final round 26. However, round key 1 is now used in round 8, round key 2 In round 7 and so on. 
[0039] A number of different architectures can be considered when designing an apparatus or circuit for knptementing 
encryption algorithms. These Include Iterative Looping (IL), where only one data processing module te used to imple- 
ment all of the rounds. Hence for an n- round algorithm, n Iterations of that round are carried out to perform an encryption , 
data being passed through the single instance of data processing module n times. Loop Unrolling (LU) Involves the 
unrolling of mutate rounds. Pipelining (P) is achieved by replicating the round I.e. devising one data processing moo\jle 
for irnpternentingthe round and using multiple Instances of me o^ 

Sub-Pipelining (SP) may be carried out on a partially pipelined design when the round is complex. It dec r eases the 
pipeline's delay between stages but increases the number of clock cycles required to perform an encryption. The 
present invention relates particularly to Iterative Loop architecture implementations. 

[0040] Figure 4 shows, in schematic form, a data encryption apparatus general* irritated at 40. The apparatus 40 
is arranged to receive a plaintext input data block (shown as •pteintexT in Figure 4) and a cipher key (shown as •key- 
in Figure 4) and to produce, after a number of encryption rounds, an encrypted data block (shown as "dphertexT in 
Figure 4). 

[0041] The apparatus 40 comprises a data/key addition module 48 for performing the data/key addition operation 
22 (Figure 2). The Data/Key Addition module 48 comprises an XOR component (not shown) arranged to perform a 
bitwise XOR operation of each byte B, of the State 10 comprising the Input plaintext, with a respective byte K, of the 
cipher key. 

[0042] The apparatus 40 further Includes a data processing module In the form of a round module 44 for Implementing 
the normal encryption rounds 24. The round module 44 comprises a round transformation module 156 and a data 
scheduling apparatus 100 according to the Invention, each of which Is described In more detail hereinafter. In the 
illustrated example, the data block length N b is assumed to be 128-bits. The data/key addition module 48 provides, to 
the apparatus 1 00, the result of the data/key addition operation which, In this example, comprises 128-bits of data. As 
is described in more detail below, this data is loaded into a plurality of data registers (not shown in Figure 4) within the 
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apparatus 100 and then supplied, 32-blts at a time (4 bytes In parallel, see Figure 4), to the transformation module 
156. The transformation module 1 66 Is arranged to perform encryption operations on the received data and to produce 
output data which, in the present example, comprises 32-blts (4 bytes in parallel as shown In Figure 4). The output 
data of the transformation module 156 is supplied to the scheduling apparatus 100 whereupon the data is loaded into 
5 registers within the apparatus 100. The scheduling apparatus 100 Is arranged, in accordance with the invention, to 
control the sending and receiving of data to and from the transformation module 156 in order to correctly implement 
the encryption algorithm. In the preferred embodiment the scheduling apparatus 100 is arranged to implement, In 
particular, the Shift Row operation of Rijndael. 

[0043] The apparatus 40 also includes a key scheduler 50 for generating sub-keys from the cipher key. The key 
10 scheduler 50 is arranged to provide the sub-keys to the transform module 156 as required. The key scheduler 50 may 
be implemented in a number of conventional ways and Is preferably arranged to supply the transformation module 1 56 
with the appropriate 32-bits of a respective sub-key in each clock cycle. 

[0044] The preferred embodiment of the apparatus 40 further includes a final round module 46 arranged to implement 
the Rijndael final round 26 in conventional manner. Once the round module 44 has finished performing the required 
15 normal encryption rounds 24, the resulting partially encrypted data is provided to the final round module 46. Preferably, 
the final round module 46 is arranged to operate on data 32-bits at a time so that the resulting clphertext is produced 
over four clock cycles. 

[0045] As is described in more detail below, the transformation module 156 operates on a portion of a State data 
array at a time (in this example one quarter of the State array namely, 32 bits out of 128 bits) and so each encryption 
20 round takes a plurality of cycles to complete (four cycles In the present example). Once all of the required encryption 
rounds are completed, the values contained in the registers within the scheduling apparatus 100 comprise the cipher- 
text. 

[0046] The present invention concerns in particular the efficient implementation of the encryption or decryption rounds 
24. While the Invention is particularly suited to, and is descrfoed herein In the context of, implementation of Rijndael, 
25 a skilled person will appreciate that the Invention may be used advantageously In the implementation of other data 
encryption/decryption algorithms of similar structure to Rijndael. 

[0047] One way to reduce the amount of resources required to Implement a round 24, 26 Is to operate on only a part 
of the state 10 at a time using a given resource and then to process the remaining parts of the state 10 one after the 
other using the same resource. For example, for the 4 column state 10 depicted In Figure 1a, the data may be operated 

so on rolumn-by-coJumn i.e. only 32-bits of the 128-bit input state 10 are operated on at any one time. In the present 
example, this means that each round is performed In 4 clock cycles (since there are 4 columns). This reduces the 
required resources, e.g. hardware gate count, by approximately 75% for one round transform. 
[0048] Figure 5 shows a schematic view of how a round 24, 26 may be Implemented on a column-by-column basis. 
The example of Figure 5 assumes that the operand 52 is a 128-bft state array I.e. 16 bytes of data arranged in four 

35 columns of 4 -byte vectors 1 2. The operand 52 is supplied to a bank 54 of switches, or multiplexers, which are arranged 
to perform the ShiftRow transformation 32. Typically, the bank 54 comprises a plurality of multiplexers in parallel. In 
the present example, the bank 54 comprises four 4^to-1 byte multiplexers (not shown), each multiplexer being arranged 
to select one byte from a respective row of the operand 52 In accordance with the ShiftRow transformation 32. The 
output of the bank 54 comprises the four bytes selected by the respective multiplexers. This output Is supplied to a 

40 transform module 56 that is arranged to implement the ByteSub transformation 30, the MixCol transformation 34 and 
the Key Addition operation 36 - these transfomiations/operattorts may be performed in any convenient conventional 
manner. In the arrangement shown In Figure 5, the transform rrwdute 56 operates™ 
with the MixColtransfom^tk^ 

performed on one byte at a time and so the transform module 56 preferably includes four instances of the resources 
45 (e.g. Look-Up Tables (LUTs)) required to implement the ByteSub transformation 30. The output of the transform rmxfcite 
56 comprises four bytes of data corresponding to one column or vector 1 2 of a result 58, the result 58 taking the form 
of a four column state array. Thus, In four successive dock cycles the whole 1 6 byte result 58 Is procfcced. Hence, in 
each dock cycle, the bank 54 and the transform module 56 perform a quarter of the round transforms i.e. they perform 
the required round transforms on one quarter of the input operand 52 to produce one quarter of the result 58. The 
50 arrow A In Figure 5 Is used to Indicate that the result 58 of one round Is used as the Input operand 52 of the next round. 
[0049] In Figure 5 for illustrative purposes, each byte of the operand 52 and result 58 Is labelled to show how the 
bank 54 of multiplexers selects bytes from each row of the operand 52 In order to Implement the ShiftRow transformation 
32. The label of each byte includes a suffix A, B, C or D indicating in which row of the state 10 the byte appears: A 
denotes the first row, B denotes the second row, and so on. Each label also includes a numeral 1 , 2, 3 or 4 to differentiate 
55 between column positions in the state 1 0. The labels of the bytes in the result 58 are given in parentheses 0 to distinguish 
them from the bytes of the Input operand 52. 

[0050] It Is assumed that the bank 54 of muittlptexers and the transform module 56 operate on Input bytes 1 A, 1 B, 
1C and 1 D in the first cycle to produce output bytes (1 A), (2A), (3A) and (4A). In the second cycle, 2A, 2B t 2C and 2D 
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are operated on to produce (2A) t (2B), (2C) and (2D): In the third cycle, 3A, 3B, 3C and 3D are operated on to produce 
(3A), (3B), (3C) and (3D). In the fourth cycle, 4A, 4B, AC and 4D are operated on to produce (4A), (4B). (4C) and (AD). 
The arrangement of the labels 1 A to 4D in Figure 5 shows how the multiplexers In the bank 54 are required to select 
bytes from the respective rows of the operand 52 in order to implement the Shift Row transformation 32. For example, 

5 in the first cycle, the multiplexer associated with the first row of the operand 52 selects the byte from the first column 
of that row, i.e. byte 1 A, while the multiplexer associated with the second row of the operand 52 selects the byte from 
the second column of that row, i.e. byte 1 B, and so on. In the second cycle, the multiplexer associated with the first 
row of the operand 52 selects the byte from the second column of that row, i.e. byte 2A, while the multiplexer associated 
with, say, the fourth row of the operand 52 selects the byte from the first column of that row, i.e. byte 2D, and so on. 

w [0051] In the arrangement of Figure 5, the bank 54 comprises four 4-to-1 byte multiplexers. This is considered to be 
costly in terms of area, ft is also considered to be desirable to have relatively few multiplexers in the computational 
data path as multiplexers have the effect of reducing throughput. 

[0052] Figures 6a to 6e illustrate the scheduling apparatus 1 00 for implementing a data encryption round according 
to one aspect of the invention. The round transformation module 156 is also shown in Figures 6a to Be. 

15 [0053] The apparatus 1 00 comprises a plurality of data registers 1 60, one register in respect of each component of 
the data block, or operand 52, upon which the transformation module 1 56 is required to operate. In the present example, 
the data block components comprise bytes and the operand 52 comprises 1 6 bytes. Hence, in Figures 6a to 6e, the 
apparatus 1 00 comprises 1 6 byte data registers 1 60. The data registers 1 60 are arranged as a plurality of shift registers, 
one for each row of the data block (State array), each shift register comprising a sequence of data registers 160. 

20 Preferably, the registers 1 60 are implemented as four four-byte shift registers, each shift register implementing a re- 
spective row (Row 0, Flow 1 , Row 2 and Row3) of four registers 1 60. Hence each register 1 60 comprises a respective 
1-byte storage location, or register, within one of the four-byte shift registers. 

The apparatus 1 00 preferably includes a further data register 1 61 which serves to delay the shifting of data In the last 
row (Row 3) of registers 1 60 as is deserted in more detail below. The transformation module 1 56 comprises apparatus 
25 (not shown) for performing the required erx^ttorv*decryptk>n operations, as described In relation to the transformation 
module 56 of Figure 5. 

[0054] The apparatus 1 00 further comprises a plurality of 2-to-1 selector switches in the form of 2 to-1 multiplexers 
(or MUXes) 162 which, in Figures 6a to 6e are labelled M1 , W2, M3, M4 and M5. 

[0055] The apparatus 100 performs the required round transformations in four successive operational cycles, or 
30 dock cycles, the transformation module 156 operating on one quarter of the Input operand in each clock cycte. The 
transformation module 1 56, the data registers 1 60 and the 2 to-1 multiplexers are afl synchronised to a common clock 
signal (not illustrated). After each cycle, outputs 164, 166, 168, 170 of the transformation module 156 (which cany 
respective transformed data bytes) are fed back into the array of registers 1 60 as shown in Figures 6a to 6e. The 2-to- 
1 multiplexers 1 62 are controlled to load the registers 1 60, either from the outputB 1 64-1 70 of the transformation module 
as 156 or from a data register 160 in the same row, or shift register. The arrangement Is such that the registers 160 are 
loaded over successive dock cycles with the particular bytes illustrated in Figures 6a to 6e. 

[0056] The operation of the apparatus 1 00 is now described with reference to Figures 6a to 6e. Initially, the registers 
1 60 are loaded with the plaintext data to be encrypted which, in this case, comprises 1 6 bytes of data, one byte being 
loaded into a respective register 160. For a 128-bit data block, and where the registers 160 are implemented as four 

40 four-byte registers, the data is conveniently shifted into each of the four four-byte registers over four clock cycles - in 
each of the four clock cycles, a respective byte will be loaded Into each of the four four-byte registers. Loading data 
into the registers 160 can be performed in any conventional manner and, in Figures 6a to 6e, loading inputs are not 
ittustrated for clarity. The plaintext bytes are arranged in the registers 160 in their natural order with respect to one 
another i.e., referring to Figures 1a and 6a, bytes P 0fit P 10 , P 2 p and P^ are loaded into the rightmost column of 

45 registers 160 as viewed in Figure 6a, bytes P ot , P t> P 2 .i and P^ t are loaded into the next adjacent column to the 
left, bytes P^. P,^, P 2 ^ and P 3 ^ are loaded into the next adjacent column to the left and bytes P 0> P 1;J , P 2>3 and 
P 3Ji are loaded into the leftmost column of registers 160. 

[0057] The (abetting of the registers 160 in Figures 6a to 6© shows how the bytes in the respective registers are 
processed during the round transformation. Figure 6a ilustrates the register contents in a first cycle, Cycle 0, In which 
50 the first four bytes to be operated on by the transform module 156 are bytes labelled 1 A, 1 B, 1C and 1 D and it may be 
seen from Figure 6a from which registers 1 60 these bytes are taken. This arrangement corresponds with the foregoing 
description relating to labelling of the operand 52 in Figure 5. 

[0058] In the following description of Figures 6b to Be, for convenience, the contents of the registers 160 are described 
on a row-by-row basis using the row number notation Row 0 to Row 3 given In the drawings. It will be understood that 
55 the term 'row* is a relational term and does not necessarily imply a particular spatial arrangement. Each row of data 
registers 160 corresponds to a respective shift register which In turn corresponds with a row of the data block (when 
considered in state array form) being operated on. Thus, the Ursf register 160 in a given row Is the register 160 that 
takes the first byte of the corresponding state array row, the final* register is the register 160 that takes the final byte, 
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and soon." 

[0059] Figure 6b shows the register contents in a second cycle, Cycle 1 . In Row 0 of the registers 1 60, new byte 
(1 A) (which was created by the transformation module 156 during Cycle 0 and Is available on a first output 164 of the 
transformation module 156) is entered into the first register 160 of Row 0. The remaining bytes of Row 0 are shitted 

5 to a respective adjacent register as shown by the arrows. Thus, byte 2A Is the next byte to be supplied to the trans- 
formation module 1 56. In Row 1 , M 1 is arranged to select new byte (1 B) from a second output 1 66 of the transformation 
module 156 for input to the final register of Row 1 . M2 Is arranged to select byte 4B from the final register of Row 1 
and to load this byte into first register of Row 1. The remaining bytes of Row 1 are shifted to a respective adjacent 
register as shown. Thus, Byte 2B is the next byte to be supplied to the transformation motkile 1 56 from Row 1 . In Row 

10 2, M3 is arranged to load new byte (1C) from output 168 of the transformation module 156 into the second register 
160 from the right in Row 2. M4 Is arranged to select byte 3C from the final register 160 and to load same Into the first 
register of Row 2. The remaining bytes of Row 2 are shitted to a respective adjacent register as shown. Thus, Byte 
2C Is the next byte to be supplied to the transf ormation module 1 56 from Row 2. With respect to Row 3, M5 Is arranged 
to select the final byte, byte 2D, from the Row 3 registers 1 60 as the Input to the first register 1 60 of Row 3. The new 

15 byte (1 D) from output 1 70 of the transf ormation module is entered Into the optional register 161 . The remaining bytes 
of Row 3 are shitted to a respective adjacent register as shown. Thus, Byte 2D Is the next byte to be supplied to the 
transformation module 156 from Row 3. 

[0060] Figure 6c shows the register contents in a third cycle, Cycle 2. In Row 0 of the registers 160, new byte (2A) 
(which was created by the transformation module 156 during Cycle 1 and is available on a first output 164 of the 

20 transformation module 156) is entered into the first register 160 of Row 0. The remaining bytes of Row 0 are shifted 
to a respective adjacent register as shown. Thus, byte 3A Is the next byte on which transformation module 1 56 operates 
from Row 0. In Row 1 , M1 is arranged to select byte (IB) for input to the final register 160 of Row 1 (i.e. there is no 
change to the contents of this register In Cycle 2). M2 is arranged to select new byte (2B) from output 166 and to load 
this byte into first register of Row 1 . The remaining bytes of Row 1 are shifted to a respective adjacent register as 

25 shown. Thus, byte 3B is the next byte to be supplied to the transformation module 156. In Row 2, M3 Is arranged to 
load new byte (2C) from output 168 of the transformation module 156 into the second register 160 from the right in 
Row 2. M4 is arranged to select byte 4C from the final register 160 and to load same into the first register of Row 2. 
The remaining bytes of Row 2 are shifted to a respective adjacent register as shown. Thus, Byte 3C is the next byte 
to be supplied to the transforrnation module 156. With respect to Row 3, M5 is arranged to select the final byte, byte 

30 3D, from the Row 3 registers 160 as the input to the first register 160 of Row 3. The new byte (2D) from output 170 of 
the transf ormation module is entered Into the optional register 161. The remaining bytes of Row 3 are shifted to a 
respective adjacent register as shown. Thus, the next byte to be supplied to the transformation module 156 from Row 
3 Is byte 3D. 

[0061] Figure 6d shows the register contents in a fourth cycle, Cycle 3. In Row 0 of the registers 160, new byte (3A) 
35 (which was created by the transformation module 156 during Cycle 2 and is available on a first output 164 of the 
transformation module 156) is entered into the first register 160 of Row 0. The remaining bytes of Row 0 are shifted 
to a respective adjacent register as shown. Thus, byte 4A is the next byte on which transformation module 1 56 operates 
from Row 0. In Row 1, M1 is arranged to select byte (1B) for input to the final register 160 of Row 1 (l.e. there Is no 
change to the contents of this register in Cycle 3). M2 is arranged to select new byte (3B) from output 168 and to load 
40 this byte into first register off Row 1. The remaining bytes of Row 1 are united to a respective adjacent register as 
shown. Thus, byte 4B Is the next byte to be supplied to the transtorrnation module 156 from Row 1. In Row 2, M4 Is 
arranged to load new byte (3C) from output 1 68 of the transformation module 1 56 into the first register 1 60 in Row 2. 
M3 is arranged to select byte (1C) from the final register 160. The remaining bytes off Row 2 are shited to 
adjacent register as shown. Thus, Byte 4C fc the next byte to be supplied to the bBnsformatxxamoAjte 156 from Row 
45 2. With respect to Row 3, M5 is arranged to select the final byte, byte 4D, from the Row 3 registers 160 as the input 
to the first register 160 of Row 3. The new byte (3D) from output 170 of the transformation mocfcjle is entered into the 
optional register 161 . The remaining bytes of Row 3 are shifted to a respective adjacent register as shown. Thus, the 
next byte to be supplied to the transformation module 1 56 from Row 3 is byte 4D. 

[0062] Figure 6e shows the register contents in a fifth cycle, Cycle 4. In Row 0 of the registers 160, new byte (4A) 
so (which was created by the transformation module 156 during Cycle 3 and Is available on a first output 164 of the 
transformation module 156) Is entered Into the first register 160 of Row 0. The remaining bytes of Row 0 are shifted 
to a respective adjacent register as shown. Thus, byte (1A) is the next byte on which transformatJon module 156 
operates from Row 0. In Row 1 , M1 Is arranged to select byte (1 B) for Input to the final register 1 60 of Row 1 (i.e. there 
is no change to the contents of this register In Cycle 4). M2 is arranged to select new byte (4B) from output 166 and 
55 to load this byte into first register of Row 1 . The remaining bytes of Row 1 are shifted to a respective adjacent register 
as shown. Thus, byte (2B) is the next byte to be supplied to the transformation module 156 from Row 1 . In Row 2, M4 
is arranged to select new byte (4C) from output 168 of the transformation mooule 156 Into the second register 160 
from the right in Row 2. M3 is arranged to select byte (2C) from the final register 160. The remaining bytes of Row 2 
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are shifted to a respective adjacent register as *howir. Thus, Byte (3C) Is the next byte to be supplied to the transfor- 
mation module 156. With respect to Row 3, M5 is arranged to select the new byte (40) from output 170 as the Input 
to the first register 1 60 of Row 3. The new byte (4D) from output 1 70 of the transformation module is also entered Into 
the optional register 161 . The remaining bytes of Row 3 are shifted to a respective adjacent register as shown. Thus, 

5 the next byte to be supplied to the transformation module 156 from Row 3 is byte (4D). 

[0063] Thus, each round is performed in four consecutive dock cycles: Cycle 0 to Cycle 1 ; Cycle 1 to Cycle 2; Cycle 
2 to Cycle 3; and Cycle 3 to Cycle 4. Successive Rounds may be performed consecutively, wherein the encrypted data 
block is comprised of the values contained in the registers 160 afterthe final round is completed. In this connection, it 
is noted that the values of Cycle 4 in one round are the Cycle 0 values of the following round. 

to [0064] Conveniently, after the encryption rounds are completed, the data in the registers 160 are passed in 32-bit 
blocks to the final round module (Figure 4) after wh fen they may be output over four dock cycles serially In 32-blt blocks. 
[0065] In an alternative embodiment (not illustrated), me optional register 16 is removed and shift control (i.e. register 
control) Is added so that the values In the second, third and fourth registers 160 In Row 3 are not shifted In the last 
cyde. However, controlling the loading of a register in this way normally adds a switch or MUX to Its input port (unless 

« the register primitive has load enable control). In the apparatus of Figure 6, this would require and additional three 
2-to-1 MUXes in place of register 161 and, In ASIC technology, three 2-to~1 MUXes are normally larger than one 
register. Therefore, the embodiment of Figures 6a to 6e is preferred. 

[0066] The present invention applies equally to the implementation of data encryption or data decryption rounds and 
may therefore be used, for example, In the Implementation of the Inverse Round transformation of a Rijndael decryption 

20 apparatus. Figure 7 shows a schematic representation of a data decryption apparatus, generally indicated at 40\ for 
implementing. In particular, Rijndael decryption. The apparatus 40/ is arranged to receive a ciphertext Input data block 
(shown as "dphertext" in Figure 7) and an inverse dpher key (shown as 'key" in Figure 4) and to produce, after a 
number of decryption rounds, a decrypted data block (shown as 'plaintext* in Figure 7). The decryption apparatus 40* 
is of generally similar design to the encryption apparatus 40 and operates in a similar manner. However, the relative 

25 positions of the data/key addition module 48' and the final round module 46* are reversed in comparison with the data 
encryption module 40. Also, the final round module 46* and the transformation module 156* are arranged to implement 
the Rgndael inverse final round and inverse normal round respectively. Further, since the Rijndael Shift Row and Inverse 
ShiftRow operations are different, the arrangement of switches, or multiplexers, within the data scheduling apparatus 
100* Is different (the shit operation performed on Rows 0 and 2 are the same In encryption and decryption. The shift 

so operation carried out on row 1 during encryption Is equivalent to the inverse shift operation carried out on Row 3 during 
decryption. Also the shift operation carried out on row 3 during encryption Is equivalent to the inverse shift row operation 
carried out on row 1 during decryption). 

[0067] Figures 8a to 8e illustrate the scheduling apparatus 1 00/ for implementing a data decryption round according 
to one aspect of the invention. The inverse round transformation module 156* is also shown In Figures 8a to Be As 
35 the scheduling apparatus 1 00' Is generally similar In design to the scheduling apparatus 1 00, similar reference numerals 
are used to indicate like parts. The operation of the scheduling apparatus 100* is now described with reference to 
Figures 8a to 8e. 

[0068] Figure 8a illustrates the register 1 60* contents in cycle 0. It will be seen that the first four bytes to be operated 
on are 1 A, 1B, 1Cand 1D. 

40 [0069] Figure 8b illustrates the register contents in cyde 1 . In Row 0 of the registers 160*, byte 2A is the next byte 
on which to be operated. New byte (1 A) Is entered Into the shirt register at the beginning of Row 0. In Row 1 , M5 selects 
final byte in the register tor Row 1 , namely byte 2B New byte (1 B) is entered Into the optional register 161\ In Row 2, 
M3 selects new byte (1C) ami M4 selects Hriatb^ In Row 3, M1 selects 

new byte (1D) and M2 selects byte 4D from the firtai register location in Row 3. 

45 [0070] Figure 8c illustrates the register contents in cycle 2. In Row 0, byte 3A is the next byte to be operated on. 
New byte (2A) is entered into the first (register) location of the Row 0 shift register. In Row 1 , M5 selects final byte in 
register, byte 3B, and new byte (2B) Is entered into register 161'. In Row 2, M3 selects new byte (2C) and M4 selects 
final byte in Row 3 register, namely byte 4C. In Row 3, M1 selects byte (1 D) from the final Row 3 register. M2 selects 
new byte (2D). 

so [0071 ] Figure Bd illustrates the register contents In cyde 3. In Row 0, byte 4A is the next byte on which to be operated. 
New byte (3A) is entered into the first register of Row 0. In Row 1 , M5 selects Anal byte In register, byte 4B. New byte 
(3B) is entered into register 161 '. In Row 2, M3 selects final byte in register, byte (1C). M4 selects new byte (3C). In 
Row 3, M1 selects final byte In the register, byte (1D). M2 selects new byte (3D). 

[0072] Figure Be illustrates the register contents in cycle 4. In Row 0, byte (1 A) is the next byte on which to be 
55 operated. New byte (4A) is entered Into the Row 0 shift register. In Row 1 , M5 selects new byte (48). New byte (4B) 
is entered into register 16V. In Row 2, M3 selects final byte in register, byte (2C). M4 selects new byte (4C). In Row 
3, M1 selects final byte in the register, byte (ID). M2 selects new byte (4D). 

[0073] As before, cycle 4 of one round serves as cyde 0 of the following round. Also, the extra register 1 61 \ in Row 
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1 couldbe removed and shift control added so that the values In the subsequent registers' t60Mrr flow t are not shifted 
in the last cycle. However, controlling the loading of a register adds a multiplexer to its input port (unless the register 
primitive has load enable control) and three 2-to-1 MUXes are larger than one register in ASIC technology. Thus, the 
arrangement shown in Figures 8a to 8e is preferred. 

[0074] It will be observed that Implementation of the encryptton/dectyption round In accordance with the invention 
removes MUXes, or other switching devices, from the computational data paths when compared with conventional 
arrangements (see, for example, Figure 5). This allows a higher design throughput to be achieved. Moreover, since 
the apparatus 100 uses 2-to-1 switches, which are smaller than the 4-to-1 switches required by the arrangement shown 
in Figure 5, the apparatus 100 is smaller. A hardware gate count comparison between a typical arrangement of the 
type shown in Figure 5 and the apparatus 100 of the invention is provided in Table 1 below. 



Tabid. 



Hardware Gate Count Comparison between Typical Implementation and Invention. 


Target Process 


4-to~1 Mux based 


Invention 


ASIC 


7644 gates* 


5701 gates* 


Xilinx FPGA 
(VIRTEX-E) 


397 LUTs, 2 BRAMs 


258 LUTs, 2 BRAMs 


Altera CPLD 
(APEX20KE) 


472 LCs, 4 ESBs 


280 LCs, 4 ESBs 



* Some Mining oofwtrsM appftod 



[0075] The foregoing description relates to the implementation of the Rijndael encryption round where the data block 
length, is 1 28-bfts. It will be understood that the invention is not limited for use in the Implementation of the Rijndael 
cipher and may be used with similarly structured ciphers. Further, the invention is not limited to use where the data 
block length is 1 28-bits. A skilled person will appreciated that the arrangements of the invention described and illustrated 
above may be modified to implement Rijndael when the data block length is 192 or 256-bfts. For 1 92-brts, an additional 
two columns of four registers would be required In the apparatus of Figures 8a to 6e, white for 256-brts, an additional 
four columns of registers would be required. 

[0076] The preferred implementation of the invention is on FPGA. However, the apparatus of the invention may 
alternatively be implemented on other conventional devices such as other Programmable Logic Devices (PLDs) or an 
ASIC (Application Specific Integrated Circuit). 

[0077] The invention is not limited to the embodiments described herein which may be modified or varied without 
departing from the scope of the Invention. 

Claims 

1. An c*)paratu8 for encrypting or decrypting a data block comprising a pluratty of data components over a plurality 
of operational cycles, the apparatus comprising a transformed encryp 
tion or decryption operations in each operational cycle; and a plu^^ 

of data registers through which data components are shifted in successive operational cycles, the transformation 
module being arranged to receive a respective data component from a respective data register from each shift 
register and to operate on each of the received data components to produce corresponding transformed data 
components, wherein at least some of said data registers are associated with a respective selector switch, the 
setting of which selector switch in each operational cycle determines whether the associated data register is loaded 
with a data component from a data register in its respective shift register or wfth the transformed data component 
corresponding to its respective shift register In said operational cycle. 

2. An apparatus as claimed in Claim 1, wherein the apparatus Is arranged to perform encryption or decryption In 
accordance with the Rijndael cipher. 

3. An apparatus as claimed in Claim 2, wherein the transformation module is arranged to perform, in whole or in part, 
a Rijndael encryption or decryption round. 

4. An apparatus as claimed in any preceding claim, wherein the apparatus is arranged to operate on data blocks 
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comprising sixteen data components, eachcomponent comprising one data byte, wherein each shift register com- 
prises four one-byte data registers. 

5. An apparatus as claimed in Claim 4 when dependent on Claim 3, wherein the transformation module is arranged 
to perform one quarter of the Rijndael encryption or decryption round. 

6. An apparatus as claimed in any preceding claim, in which each switch comprises a 2-to-1 selector switch. 

7. An apparatus as claimed In any of claims 2 to 6, wherein the apparatus is arranged to perform encryption and 
further includes: a data/key addition module arranged to receive a data block to be encrypted and a cipher key, 
the data/key addition module being arranged to perform, an addition operation on the received data block and key 
and to cause the result of the addition operation to be loaded into the shift registers; a key scheduling module 
arranged to generate sub-keys from the cipher key and to provide the sub-keys to the transformation module; and 
a module arranged to implement the Rijndael final encryption round, the final encryption round module being 
arranged to receive transformed data components from the data registers. 

8. An apparatus as claimed in any of claims 2 to 6, wherein the apparatus is arranged to perform decryption and 
further includes: a module arranged to arranged to receive a data block to be decrypted, to implement the Rijndael 
inverse final encryption round on the received data block and to cause the result to be loaded into the shift registers; 
a key scheduling module arranged to receive a cipher key, to generate sub-keys from the cipher key and to provide 
the sub-keys to the transformation module; and a data/key addition module arranged to receive transformed data 
components from the data registers. 

9. A method of encrypting or decrypting a data block, comprising a plurality of data components, over a plurality of 
operational cycles, the method comprising: loading the data components Into a respective data register, each data 
register being one of a sequence of data registers in one of a plurality of shift registers; and in respect of each 
operational cycle, causing a data component from a respective data register of each shift register to undergo one 
or more data encryption or decryption operations to produce a corresponding transformed data component; and 
setting at least one selector switch to determine whether an associated data register is loaded with a data com- 
ponent from a data register in Its respective shift register or with the transformed data component corresponding 
to its respective shift register. 

10. A computer program product comprising computer usable instructions for generating an apparatus according to 
Claim 1. 
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